AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalised campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customers queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.
To identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.
Data is of various customers of a bank with their credit limit, the total number of credit cards the customer has, and different channels through which customer has contacted the bank for any queries, different channels include visiting the bank, online and through a call centre.
(this notebook loaded to GitHub at https://github.com/GreatLearningAIML1/gl-pgp-aiml-uta-intl-may20-pierssmith/tree/master/Week%2016%20-%20Project%20-%20Unsupervised%20Learning)
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import pandas_profiling
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from scipy.stats import zscore
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import cdist, pdist
sns.reset_orig()
df_abt_CustomerData = pd.read_excel('Credit Card Customer Data.xlsx')
df_abt_CustomerData.head()
df_abt_CustomerData.tail()
df_abt_CustomerData.info()
The dataset contains 660 observations. There are 7 columns, all loaded as numeric (integer), and no null values.
(Requirement 1. Perform univariate analysis on the data to better understand the variables at your disposal and to get an idea about the no of clusters. Perform EDA, create visualizations to explore data.)
(Requirement 2. Properly comment on the codes, provide explanations of the steps taken in the notebook and conclude your insights from the graphs.)
pandas_profiling.ProfileReport(df_abt_CustomerData)
The project brief describes five data points for each record
There are two additional columns;
df_abt_CustomerData.drop(['Sl_No', 'Customer Key'], axis=1, inplace=True)
df_abt_CustomerData.describe().transpose()
There are no duplicate rows in the profiling report, and no missing values as noted above, so there are no imputation or duplicate removal steps required.
Avg_Credit_Limit (integer) - the customer's credit limit. The name of variable as 'average' suggests customers in the sample hold more than one product. The data is in the range \$3000 to \\$200000, and there are no zeroes and no nulls. The range appears reasonable for credit limits. The median is \$18000 wit a mean of \\$34574, the distribution is heavily right skewed (skewness 2.202) as can be observed in the distribution and box plots. This variable is the only data point where the unit is dollars, as the other variables are all counts as described in the project brief (credit cards and channel contracts), so scaling will be used before the cluster analysis.
df_abt_CustomerData['Avg_Credit_Limit'].describe()
g1 = sns.distplot(df_abt_CustomerData['Avg_Credit_Limit'])
g1.grid()
g1.set(title='Distribution of average credit limit')
plt.show()
g1 = sns.boxplot(df_abt_CustomerData['Avg_Credit_Limit'])
g1.grid()
g1.set(title='Boxplot of average credit limit')
plt.show()
The boxplot shows many datapoints outside the whiskers due to the right skew of the distribution. Dropping the outliers would result in a significant loss of information for this analysis, and because the purpose of this analysis is to examine clustering and the units for the variables are different, zscore scaling will be the strategy for this analysis.
Total_Credit_Cards (integer) - total number of cards. This variable gives the context for 'average' credit limit in previous variable - this sample of customers all hold at least one credit card. Correspondingly, the data is in the range 1 to 10, and there are no missing values and no zeroes. The maximum value is of note - 10 credit cards seems very high, however there are 19 observations with this value, and each value from 1 to 10 is represented, so this does not appear to be an error. The median is 5, and the mean 4.7, and the distribution appears quite normal. Again, this data is on a unique scale, counting the number of a particular product, so the scaling strategy will be used.
df_abt_CustomerData['Total_Credit_Cards'].describe()
g1 = sns.distplot(df_abt_CustomerData['Total_Credit_Cards'])
g1.grid()
g1.set(title='Distribution of total credit cards')
plt.show()
g1 = sns.boxplot(df_abt_CustomerData['Total_Credit_Cards'])
g1.grid()
g1.set(title='Boxplot of total credit cards')
plt.show()
Total_visits_bank (integer) - number of times the customer visited the bank. In the context of the other channels as described in the project brief, this appears to refer to a face-to-face contact in a branch. The range is 0 to 5, with 100 (15.2%) zeroes - in the context of this being the usage count of a particular channel, this suggests many customers don't visit in person. The time period for sampling of this data is unknown, but given two other channel options are captured in this data, and considering contemporary assumptions about customer behaviour, the number of 0 observations does not seem unreasonable. The median is 2, and the mean 2.4, and the distribution normal.
df_abt_CustomerData['Total_visits_bank'].describe()
g1 = sns.distplot(df_abt_CustomerData['Total_visits_bank'])
g1.grid()
g1.set(title='Distribution of channel usage - bank')
plt.show()
g1 = sns.boxplot(df_abt_CustomerData['Total_visits_bank'])
g1.grid()
g1.set(title='Boxplot of channel usage - bank')
plt.show()
Total_visits_online (integer) - number of times the customer has contacted the bank online. The data is in the range 0 to 15, with 144 (21.8%) zeroes, suggesting many customers don't contact online. This is interesting, and in an actual customer engagement it would be interesting to collect additional demographic data to see if there are other factors that might influence channel choice. The range appears reasonable however, given there are a few barriers to online access and the demographic and time period for observations are unknown. The median is 2 and the mean 2.6, and the distribution is heavily right skewed (skewness 2.226), suggesting there are some outlying customers who make extensive use of this channel. Again, scaling is the strategy for this analysis.
df_abt_CustomerData['Total_visits_online'].describe()
g1 = sns.distplot(df_abt_CustomerData['Total_visits_online'])
g1.grid()
g1.set(title='Distribution of channel usage - online')
plt.show()
g1 = sns.boxplot(df_abt_CustomerData['Total_visits_online'])
g1.grid()
g1.set(title='Boxplot of channel usage - online')
plt.show()
Total_calls_made (integer) - number of times the customer contacted the bank via the call centre. The data is in the range 0 to 10, with 97 (14.7%) zeroes, so this is broadly consistent with the pattern in the other two channels, both in terms of the range and the percentage of customers who don't use the channel. As such, the range is reasonable. The median is 3, and the mean 3.6, and the data has a moderate right skew.
df_abt_CustomerData['Total_calls_made'].describe()
g1 = sns.distplot(df_abt_CustomerData['Total_calls_made'])
g1.grid()
g1.set(title='Distribution of channel usage - call centre')
plt.show()
g1 = sns.boxplot(df_abt_CustomerData['Total_calls_made'])
g1.grid()
g1.set(title='Boxplot of channel usage - call centre')
plt.show()
sns.pairplot(df_abt_CustomerData, diag_kind='kde');
The observation of the distribution plots together suggest a consistent pattern of around 3-4 potential clusters.
In terms of the off-diagonal analysis, there are some relationships apparent in the channel choices.
This suggests the customers in this cohort who have higher rates of usage of online banking make fewer trips to the branch, and fewer calls. That is, they make more use of self-service channels, and this will be interesting to consider later in the context of the bank's objective to upgrade the service model and ensure queries are resolved faster. If customers who use the other channels can be encouraged to use self-service, that would seem to be a good strategy to resolve query resolution times, and save channel servicing costs at the same time.
Conversely, the customers in this cohort who prefer the branch don't use online or call centre channels as much, so this cohort has a high servicing cost and are likely to be correlated with the group who perceive service response time poorly.
This same pattern also suggests customers have a preferred channel, which is why all the channels are negatively correlated with each other. This again suggests a strategy to resolve servicing times would be to target customers using the branch and call centre channels and move them to self-service.
The relationships between average credit limit and channel usage are also of note. There is a strong positive correlation between higher average credit limits and the use of the online channel, and a negative correlation with the branch and call centre. This suggests customers with higher limits (and presumably higher incomes to service them) make greater use of self-service channels than people with lower limits - again, this is useful insight in the context of the business objectives to improve servicing times. Unsurprisingly, there is a linear relationship apparent between average credit limit and the number of credit cards.
corr = df_abt_CustomerData.corr()
sns.heatmap(corr, annot=True, cmap='Blues');
The correlation matrix reflects the insights from the pair plot.
Before starting the cluster analysis, create a new dataset for analysis which is scaled using zscore.
df_set_ScaledCustomerData = df_abt_CustomerData.apply(zscore)
df_set_ScaledCustomerData.head()
(Requirement 3. Execute K-means clustering use elbow plot and analyse clusters using boxplot)
# Create working copies of the analysis base table (abt) and scaled data
df_working_CustomerData = df_abt_CustomerData.copy()
df_working_ScaledCustomerData = df_set_ScaledCustomerData.copy()
Create an elbow plot to examine some candidate values for k.
clusters=range(1,10)
avg_distortions=[]
for i in clusters:
mod_KMeans = KMeans(n_clusters=i, random_state=96)
mod_KMeans.fit(df_working_ScaledCustomerData)
avg_distortions.append(sum(np.min(cdist(df_working_ScaledCustomerData, mod_KMeans.cluster_centers_, 'euclidean'), axis=1)) / df_working_ScaledCustomerData.shape[0])
plt.plot(clusters, avg_distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
There is a pronounced elbow at where k is equal to 3, and another slight elbow where k is equal to 4. This would appear to be consistent with the observations in the bivariate analysis above. As there are two candidates, the strategy for this analysis will be to do two rounds of KMeans using both values for k, and compare the results.
mod_KMeans_k3 = KMeans(n_clusters=3, random_state=96)
mod_KMeans_k3.fit(df_working_ScaledCustomerData)
pred_KMeans_k3 = mod_KMeans_k3.predict(df_working_ScaledCustomerData)
df_working_CustomerData["Group"] = pred_KMeans_k3
df_set_KMeansCustomerData_k3 = df_working_CustomerData.groupby(['Group'])
df_set_KMeansCustomerData_k3.mean()
df_working_ScaledCustomerData["Group"] = pred_KMeans_k3
df_working_ScaledCustomerData.boxplot(by='Group', layout = (2,3),figsize=(15,10));
mod_KMeans_k4 = KMeans(n_clusters=4, random_state=96)
mod_KMeans_k4.fit(df_working_ScaledCustomerData)
pred_KMeans_k4 = mod_KMeans_k4.predict(df_working_ScaledCustomerData)
df_working_CustomerData["Group"] = pred_KMeans_k4
df_set_KMeansCustomerData_k4 = df_working_CustomerData.groupby(['Group'])
df_set_KMeansCustomerData_k4.mean()
df_working_ScaledCustomerData["Group"] = pred_KMeans_k4
df_working_ScaledCustomerData.boxplot(by='Group', layout = (2,3),figsize=(15,10));
Using k of 4 has not improved the clustering, when evaluated in the context of the business problem. The clusters with the high and low average credit limits have barely changed (the cluster with the highest average credit limit is identical whether k is 3 or 4, and the cluster with the lowest average credit limit only varies by $10 whether k is 3 or 4, and fractional changes in channel usage).
All that has been achieved with k of 4 is splitting the middle group, and not in a useful way. The boxplots show that the average credit limits almost completely overlap, and the boxplots also show the only point of differentiation appears to be the number of visits made to the bank. However, visits to be bank already differentiates this cluster when k is 3 as can be seen in the first set of boxplots, and given the objective of the bank is to improve servicing times (which would mean targeting channels like in-person visits), splitting this cluster offers no value. k of 3 is the best result.
# Reset the working copy of the scaled customer data to drop the group column
df_working_ScaledCustomerData = df_set_ScaledCustomerData.copy()
# Create a dataframe to hold the score results
df_KMeansSilhouettes = pd.DataFrame(columns=['k', 'silhouette'])
# Will try the two candidates for k tested above, and the k either side
n_clusters = [2, 3, 4, 5]
i = 0
for n_cluster in n_clusters:
mod_KMeans = KMeans(n_clusters=n_cluster, random_state=96)
cluster_labels = mod_KMeans.fit_predict(df_working_ScaledCustomerData)
silhouette_avg = silhouette_score(df_working_ScaledCustomerData, cluster_labels)
df_KMeansSilhouettes.loc[i] = [n_cluster, silhouette_avg]
i+=1
df_KMeansSilhouettes.sort_values(by=['silhouette'], ascending=False, inplace=True)
df_KMeansSilhouettes.reset_index(drop=True,inplace=True)
df_KMeansSilhouettes.head()
A k of 3 produces the best silhouette score of 0.5157. This confirms the result of the elbow plot, the observations in the bivariate analysis, as well as being the best in the context of the business requirements, as per the analysis above.
(Requirement 4. Execute hierarchical clustering (with different linkages) with the help of dendrogram and cophenetic coeff. Analyse clusters formed using boxplot)
# Create working copies of the analysis base table (abt) and scaled data
df_working_CustomerData = df_abt_CustomerData.copy()
df_working_ScaledCustomerData = df_set_ScaledCustomerData.copy()
df_Cophenetics = pd.DataFrame(columns=['k', 'cophenet'])
linkages = ["average", "complete", "ward", "centroid", "single"]
i = 0
for link in linkages:
Z = linkage(df_working_ScaledCustomerData, metric='euclidean', method=link)
c, coph_dists = cophenet(Z , pdist(df_working_ScaledCustomerData))
df_Cophenetics.loc[i] = [link, c]
i+=1
df_Cophenetics.sort_values(by=['cophenet'], ascending=False, inplace=True)
df_Cophenetics.reset_index(drop=True,inplace=True)
df_Cophenetics.head()
The 'average' linkage produces the best cophenetic result.
Having established the best linkage for this dataset, calculating the silhouette for different values of k will help identify the best number of clusters to use.
df_HierSilhouettes = pd.DataFrame(columns=['k', 'silhouette'])
n_clusters = [2, 3, 4, 5, 6]
i = 0
for n_cluster in n_clusters:
mod_AggClust = AgglomerativeClustering(n_clusters=n_cluster, affinity='euclidean', linkage='average')
cluster_labels = mod_AggClust.fit_predict(df_working_ScaledCustomerData)
silhouette_avg = silhouette_score(df_working_ScaledCustomerData, cluster_labels)
df_HierSilhouettes.loc[i] = [n_cluster, silhouette_avg]
i+=1
df_HierSilhouettes.sort_values(by=['silhouette'], ascending=False, inplace=True)
df_HierSilhouettes.reset_index(drop=True,inplace=True)
df_HierSilhouettes.head()
Whereas the elbow plot and silhouette scores for KMeans clustering both suggested a k of 3, this result suggests just two clusters for hierarchical clustering produce the best result in terms of observations being well matched to their own cluster and not well matched to the neighbouring cluster. As k of 3 still produces the second best score, as well as being the result from KMeans, again the strategy here will be to do the clustering in two rounds, with k of 2 and 3 and compare the result.
mod_AggClust_k2 = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='average')
mod_AggClust_k2.fit(df_working_ScaledCustomerData)
df_working_CustomerData['Group'] = mod_AggClust_k2.labels_
df_set_HierCustomerData_k2 = df_working_CustomerData.groupby(['Group'])
df_set_HierCustomerData_k2.mean()
df_working_ScaledCustomerData["Group"] = mod_AggClust_k2.labels_
df_working_ScaledCustomerData.boxplot(by='Group', layout = (2,3),figsize=(15,10));
mod_AggClust_k3 = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='average')
mod_AggClust_k3.fit(df_working_ScaledCustomerData)
df_working_CustomerData['Group'] = mod_AggClust_k3.labels_
df_set_HierCustomerData_k3 = df_working_CustomerData.groupby(['Group'])
df_set_HierCustomerData_k3.mean()
df_working_ScaledCustomerData["Group"] = mod_AggClust_k3.labels_
df_working_ScaledCustomerData.boxplot(by='Group', layout = (2,3),figsize=(15,10));
The result for three clusters (k=3) for hierarchical clustering produces a grouping that is very similar to the k=3 result for KMeans clustering. Noting that the cluster with the highest average credit limit is identical across both methods and for the two values of k tested for each one, the effect of setting k=2 in practical terms is just to merge the two clusters that form around the lower credit limits. The boxplots for k=2 show overlaps only in the bank and call centre channel choices, and given these are the two channels that should strategically be the focus for reducing query servicing times, this suggests k=2 may be a good strategy. The boxplots for three clusters as expected are almost identical to the plots for three clusters for KMeans.
Z = linkage(df_working_ScaledCustomerData, metric='euclidean', method='average')
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
The dendogram shows that for two clusters the effect is to cluster the outliers, which is consistent wtih two of the patterns observed in the bivariate analysis - the first being that the average credit limit distribution was heavily right skewed, and the second being that preference for the online channel was highly correlated with average credit limit. This means one cluster has formed around higher credit limits and online channel preference, with the second cluster (at k=2) forming between the mean and median credit limit and with the bank and call centre channel preferences. The effect of splitting this cluster (when k=3) divides the groups by average credit limit again, with some separation around channel preference. This tends to suggest that k=2 and k=3 may BOTH be valuable, but for different business problems. The brief for this analysis was to actually solve two problems - 1) upselling and marketing, and 2) addressing perceived issues with the quality of support services - one of these speaks to understanding groupings of customers around credit usage, and the other speaks to grouping around channel usage. This is considered below.
(Requirement 5. Calculate average silhouette score for both methods.)
(Requirement 6. Compare K-means clusters with Hierarchical clusters.)
df_set_KMeansCustomerData_k3.mean()
df_KMeansSilhouettes.head()
df_set_HierCustomerData_k2.mean()
df_HierSilhouettes.head()
Hierarchical clustering at k=2 produced the best silhouette score between the two methods at 0.5703, and hierarchical clustering also scored fractionally better at k=3 (0.5159 vs 0.5157). It is noteworthy that that hierarchical clustering at k=2 and KMeans clustering at k=3 produce an identical group at the top end of average credit limit. As was noted in the observations above, this suggests that hierarchical clustering at k=3 may actually split in a way that assists the two separate objectives for this business problem.
(Requirement 7. Analyse the clusters formed, tell us how is one cluster different from another and answer all the key questions.)
df_set_HierCustomerData_k3.mean()
df_set_HierCustomerData_k2.mean()
Consider the 'key questions' from the problem statement in the project brief, and consider them in the context of the stated objectives to (1) target new customers as well as upsell to existing customers, and (2) ensure that customers queries are resolved faster. Solving for objective 1 means understanding customer behaviour in terms of product usage, and solving for objective 2 means understanding customer channel usage (with the thesis being that the best way to quickly solve for answering queries faster is to shift customers away from in-person channels that are constrained in availability, and towards scalable, self-service channels that are available 24/7, like online). With this for context, the three clusters formed at k=3 will be best for objective 1 (market and upsell), and the two clusters formed at k=2 will be best for objective 2 (faster query resolution).
Returning then to the 'key questions'.
For targeting upselling and marketing, there are three segments of customers, which provide two clusters that could be targeted for marketing and upselling. For targeting faster query resolution there are two segments of customers, which provide a single cluster that can be targeted for changing customer behaviour to use more self-service channels.
The effect of hierarchical clustering is that the k=3 clusters effectively split the lower credit limit group into two - this was observed in the dendogram above. So, taking the top two groups first;
These groups differ markedly in average credit limit, and contact channel choice. The lower credit limit customers favour in-person channels, and the higher credit limit customers favour self-service.
When dividing to three clusters, Group 1 is unchanged, but Group 0 further divides as follows (group names are references to the indexes in the dataframes above);
Group 1, the group of customers who have the highest credit limits, would not be a target for either of the objectives in this campaign - on average they already have a credit limit of \$141,040 and hold an average of 8-9 credit cards, so the opportunities to market and upsell would appear limited. Concurrently, they make the greatest use of the self-service channel, and on average do not visit bank branches and barely use the call centre, so again the opportunities to target this group for improvements to servicing time would appear limited.
Group 0 at k=2 is the group to target for improvements to query resolution time. This group made between 2-3 visits to the bank AND 3-4 calls to the call centre in the observation period, but used the online channel fewer than 2 times. This cluster should be targeted for an education, marketing, or incentive based campaign to increase the uptake of the online channel, so that more of their contacts are resolved using self-service instead of in-person contacts - this will increase the speed of query resolution simply through behaviour change.
Finally, Group 2 at k=3 is the group to target for marketing and upselling. This group only has an average credit limit of \$12197, and hold an average of 2-3 cards, suggesting an opportunity to target this group for marketing new credit products. Of note is the strong channel preference for this group for the call centre, followed by online - this suggests a channel-based approach focussing on the call centre and online advertising (as opposed to in-branch promotions) will be the most effective in reaching these customers.